library(topGO)
library(knitr)   # to use kable()
library(limma)   # to use vennDiagram()
library(ggplot2)
TAG <- params$TAG
VAR <- params$VAR
ANNOTATION <- list(genes = '../2019-07-26/genes/annotation.txt',
                   isoforms = '../2019-07-26/transcripts/annotation.txt')
TAGDIR <- paste0('../2020-01-08/', TAG, '/')

Introduction

This is the enrichment analysis for genes regulated by hatching. Because I quantified the association of expression with hatching in two different ways, this document has two sections. Section Variance reports the enrichment analysis performed with genes ordered by the proportion of expression variance explained by hatching. Note that some genes with low variance may still be highly associated with hatching, even if the fold change between levels of this factor is low. Section Differential expression uses an ordering of genes based on the significance of the differential expression between levels of hatching, which does depend on fold change.

Reading the data

Functional annotation is in 2019-07-26. I will also upload two lists of genes, with either proportion of variance explained by hatching or p-value of differential expression test.

tagVariance <- read.table(paste0(TAGDIR, VAR, '_variance.txt'))
tagPValue   <- read.table(paste0(TAGDIR, VAR, '_pvalue.txt'))
annotation  <- read.table(ANNOTATION[[TAG]], col.names = c('tagname', 'goterms'))

To initialize the topGOdata object, I will need the gene list as a named numeric vector, where the names are the genes identifiers and the numeric values, either the portion of variance explained by hatching or the p-values of the differential expression test. The structure() function adds attributes to an object.

Variance <- structure(tagVariance[,1], names = row.names(tagVariance))
PValues  <- structure(tagPValue[,1],   names = row.names(tagPValue))
rm(tagVariance, tagPValue)

There are 18598 genes scored with a variance portion and a differential expression p-value. It should actually be the exact same genes. The annotation data frame has more than one GO term for every tag, separated by |. I need a named list.

head(annotation)
##       tagname                                                           goterms
## 1 XLOC_000002                                  GO:0008417|GO:0016020|GO:0006486
## 2 XLOC_000009                                  GO:0043130|GO:0005515|GO:0043161
## 3 XLOC_000010                       GO:0005840|GO:0015935|GO:0003735|GO:0006412
## 4 XLOC_000015 GO:0006355|GO:0003700|GO:0006357|GO:0043565|GO:0005634|GO:0032502
## 5 XLOC_000021                       GO:0016567|GO:0006397|GO:0061630|GO:0008270
## 6 XLOC_000036                                  GO:0005272|GO:0006814|GO:0016020
allgenes2GO <- strsplit(as.character(annotation$goterms), "|", fixed = TRUE)
names(allgenes2GO) <- annotation$tagname
rm(annotation)

There are 9691 genes with GO annotations. But the differential expression analysis includes many more genes. In order to include the not-annotated genes in the enrichment analysis, to see if annotated or not annotated genes are more or less often differentially expressed, I should attribute a GO term to them. According to [http://geneontology.org/docs/faq/] nowadays we express lack of annotation by annotating to the root nodes, i.e. GO:0008150 biological_process, GO:0003674 molecular_function, and GO:0005575 cellular_component.

for (tag in unique(c(names(PValues), names(Variance)))) {
   if (! tag %in% names(allgenes2GO)) {
      allgenes2GO <- append(allgenes2GO,
         structure(list(c("GO:0008150", "GO:0003674", "GO:0005575")), names = tag))
   }
}

Using differential expression p-values

Building the topGO object

Creation of a topGO dataset is documented in section 4 of topGO’s the user manual: https://bioconductor.org/packages/release/bioc/vignettes/topGO/inst/doc/topGO.pdf. I need to use the new command, and fill up the slots. The annot object must be a function that compiles “a list of GO terms such that each element in the list is a character vector containing all the gene identifiers that are mapped to the respective GO term.” There are several options, that you can check using help(annFUN.gene2GO), for example. The annFUN.gene2GO requires a gene2GO argument, which is the list of gene-to-GO terms I made before. The geneSelectionFun object is a function that selects the interesting (most significant) genes. It is required to perform Fisher’s exact test. The nodeSize is used to prune the GO hierarchy from the terms which have less than n annotated genes.

I generate three datasets, to analyse each of the three ontologies.

selection <- function(allScores) {return(allScores < 0.05)}
GOdata.BP <- new('topGOdata',
   description = 'Differential expression among Brachionus plicatilis populations.',
   ontology = 'BP', 
   allGenes = PValues,
   annot = annFUN.gene2GO,
   gene2GO = allgenes2GO,
   geneSelectionFun = selection,
   nodeSize = 5)

GOdata.MF <- new('topGOdata',
   description = 'Differential expression among Brachionus plicatilis populations.',
   ontology = 'MF',
   allGenes = PValues,
   annot = annFUN.gene2GO,
   gene2GO = allgenes2GO,
   geneSelectionFun = selection,
   nodeSize = 5)

GOdata.CC <- new('topGOdata',
   description = 'Differential expression among Brachionus plicatilis populations.',
   ontology = 'CC',
   allGenes = PValues,
   annot = annFUN.gene2GO,
   gene2GO = allgenes2GO,
   geneSelectionFun = selection,
   nodeSize = 5)

DataSummary <- data.frame(ontology = c('BP', 'MF', 'CC'),
   Num_Genes = sapply(list(GOdata.BP, GOdata.MF, GOdata.CC), numGenes),
   Num_GO_terms = sapply(list(GOdata.BP, GOdata.MF, GOdata.CC), function(x) length(usedGO(x))))
kable(DataSummary, caption='Number of feasible genes or transcripts and number of GO terms used in each data set.')
rm(DataSummary)
Number of feasible genes or transcripts and number of GO terms used in each data set.
ontology Num_Genes Num_GO_terms
BP 15162 1154
MF 17429 576
CC 12940 291

Running the tests

There are more than one way to test for enrichment. Something that took me a while to understand is that not only there are different test statistics (Fisher’s exact test, Kolmogorov-Smirnov, and others) but also different algorithms: classic, elim, weight… The algorithms are ways to deal with the dependence structure among GO terms due to topology. Some algorithms are compatible with all statistics implemented in topGO. But the weight and the parentchild algorithms are only applicable to Fisher’s exact test. I am not interested in the classic algorithm, which treats GO nodes as independent, and therefore produces an excess of significant results. I will not use the Fisher’s exact test, because it dependes on an arbitrary threshold of significance on non-adjusted p-values.

BP.elim     <- runTest(GOdata.BP, algorithm = "elim",     statistic = "ks")
BP.weight01 <- runTest(GOdata.BP, algorithm = "weight01", statistic = "ks")
BP.lea      <- runTest(GOdata.BP, algorithm = "lea",      statistic = "ks")
MF.elim     <- runTest(GOdata.MF, algorithm = "elim",     statistic = "ks")
MF.weight01 <- runTest(GOdata.MF, algorithm = "weight01", statistic = "ks")
MF.lea      <- runTest(GOdata.MF, algorithm = "lea",      statistic = "ks")
CC.elim     <- runTest(GOdata.CC, algorithm = "elim",     statistic = "ks")
CC.weight01 <- runTest(GOdata.CC, algorithm = "weight01", statistic = "ks")
CC.lea      <- runTest(GOdata.CC, algorithm = "lea",      statistic = "ks")

ResultsSummary <- data.frame(ontology = rep(c("BP", "MF", "CC"), each = 3),
   algorithm = rep(c("elim", "weight01", "lea"), 3),
   TermsTested = sapply(list(BP.elim, BP.weight01, BP.lea, MF.elim, MF.weight01, MF.lea, CC.elim, CC.weight01, CC.lea), function(X) length(score(X))),
   Significant = sapply(list(BP.elim, BP.weight01, BP.lea, MF.elim, MF.weight01, MF.lea, CC.elim, CC.weight01, CC.lea), function(X) sum(score(X) < 0.01)))

kable(ResultsSummary, caption="Number of non-trivial terms tested and those with a score (not corrected p-value) lower than 0.01.")
Number of non-trivial terms tested and those with a score (not corrected p-value) lower than 0.01.
ontology algorithm TermsTested Significant
BP elim 1154 21
BP weight01 1154 12
BP lea 1154 51
MF elim 576 13
MF weight01 576 14
MF lea 576 22
CC elim 291 6
CC weight01 291 4
CC lea 291 9
rm(ResultsSummary)

Results

The topGO package does not pay much attention to what terms are significant because they are overrepresented and which ones are underrepresented among the most differentially expressed genes. I think it’s worth separating them, to facilitate the biological interpretation. Note that not all terms listed in the tables below are significant. The scores for the three methods (elim, weight01 and lea) are non-corrected p-values.

Biological process

orderedTerms <- names(sort(score(BP.weight01)))
significant.weight01 <- score(BP.weight01)[orderedTerms] <= 0.01
significant.lea      <- score(BP.lea)[orderedTerms] <= 0.01
significant.elim     <- score(BP.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
BP.pvalue.sigTerms <- sigTerms

BP.all <- GenTable(GOdata.BP, elim=BP.elim, weight01=BP.weight01, lea=BP.lea,
                   orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))

kable(
   BP.all[BP.all$Significant > BP.all$Expected,],
   caption = "Most over-represented terms of the Biological Process ontology.")
Most over-represented terms of the Biological Process ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
2 GO:0006486 protein glycosylation 75 5 2.96 4 0.0014 0.0011 0.00135
4 GO:1901264 carbohydrate derivative transport 11 2 0.43 199 0.1221 0.0029 0.12212
5 GO:0015931 nucleobase-containing compound transport 29 2 1.14 504 0.4185 0.0029 0.41849
7 GO:0034314 Arp2/3 complex-mediated actin nucleation 12 2 0.47 7 0.0036 0.0036 0.00357
9 GO:0072330 monocarboxylic acid biosynthetic process 24 2 0.95 600 0.5131 0.0045 0.51711
11 GO:0009968 negative regulation of signal transducti… 9 1 0.35 8 0.0036 0.0065 0.00362
13 GO:0003341 cilium movement 10 1 0.39 22 0.0112 0.0112 0.01125
14 GO:0001522 pseudouridine synthesis 13 1 0.51 23 0.0113 0.0113 0.01132
15 GO:0035435 phosphate ion transmembrane transport 8 1 0.32 25 0.0114 0.0114 0.01138
16 GO:0044341 sodium-dependent phosphate transport 8 1 0.32 26 0.0114 0.0114 0.01138
kable(
   BP.all[BP.all$Significant < BP.all$Expected,],
   caption = "Most under-represented terms of the Biological Process ontology.")
Most under-represented terms of the Biological Process ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
1 GO:0060271 cilium assembly 46 1 1.81 1 5.9e-07 6.4e-06 5.9e-07
3 GO:0046907 intracellular transport 202 6 7.97 448 0.3620 0.0015 0.36202
6 GO:0007264 small GTPase mediated signal transductio… 81 2 3.19 10 0.0039 0.0033 0.00393
8 GO:0006418 tRNA aminoacylation for protein translat… 40 0 1.58 11 0.0042 0.0042 0.00421
10 GO:0043604 amide biosynthetic process 215 5 8.48 713 0.6214 0.0046 0.50026
12 GO:0016579 protein deubiquitination 40 1 1.58 16 0.0067 0.0067 0.00674
17 GO:0006511 ubiquitin-dependent protein catabolic pr… 102 2 4.02 677 0.5920 0.0121 0.59197
18 GO:0006012 galactose metabolic process 6 0 0.24 33 0.0146 0.0146 0.01462
19 GO:0035556 intracellular signal transduction 173 6 6.82 18 0.0081 0.0160 0.00024
20 GO:0098813 nuclear chromosome segregation 27 1 1.06 465 0.3725 0.0174 0.37246
21 GO:1901135 carbohydrate derivative metabolic proces… 300 11 11.83 384 0.2854 0.0195 0.45408
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
                             lea=significant.lea,
                             elim=significant.elim)))

kable(data.frame(Term = Term(GOTERM[sigTerms]),
                 Definition = Definition(GOTERM[sigTerms]),
                 PValue=score(BP.weight01)[sigTerms]),
      caption = paste('Biological process terms significantly associated with',
                      VAR, 'according to all 3 algorithms', sep=' '))
Biological process terms significantly associated with hatching according to all 3 algorithms
Term Definition PValue
GO:0060271 cilium assembly The assembly of a cilium, a specialized eukaryotic organelle that consists of a filiform extrusion of the cell surface. Each cilium is bounded by an extrusion of the cytoplasmic membrane, and contains a regular longitudinal array of microtubules, anchored basally in a centriole. 0.0000064
GO:0006486 protein glycosylation A protein modification process that results in the addition of a carbohydrate or carbohydrate derivative unit to a protein amino acid, e.g. the addition of glycan chains to proteins. 0.0010653
GO:0007264 small GTPase mediated signal transduction Any series of molecular signals in which a small monomeric GTPase relays one or more of the signals. 0.0033357
GO:0034314 Arp2/3 complex-mediated actin nucleation The actin nucleation process in which actin monomers combine to form a new branch on the side of an existing actin filament; mediated by the Arp2/3 protein complex and its interaction with other proteins. 0.0035728
GO:0006418 tRNA aminoacylation for protein translation The synthesis of aminoacyl tRNA by the formation of an ester bond between the 3’-hydroxyl group of the most 3’ adenosine of the tRNA and the alpha carboxylic acid group of an amino acid, to be used in ribosome-mediated polypeptide synthesis. 0.0042091
GO:0009968 negative regulation of signal transduction Any process that stops, prevents, or reduces the frequency, rate or extent of signal transduction. 0.0065332
GO:0016579 protein deubiquitination The removal of one or more ubiquitin groups from a protein. 0.0067352

I think the GO graph is useful to see the relationship among the significant terms. But too large graphs are impossible to read. I don’t know how to split the graph in meaningful subgraphs.

showSigOfNodes(GOdata.BP, score(BP.weight01),
               firstSigNodes = sum(significant.elim),
               wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 179 
## Number of Edges = 335 
## 
## $complete.dag
## [1] "A graph with 179 nodes."

This is just a example of the most significant GO term:

showGroupDensity(GOdata.BP, names(score(BP.weight01))[which.min(score(BP.weight01))])

Molecular function

orderedTerms <- names(sort(score(MF.weight01)))
significant.weight01 <- score(MF.weight01)[orderedTerms] <= 0.01
significant.lea      <- score(MF.lea)[orderedTerms] <= 0.01
significant.elim     <- score(MF.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
MF.pvalue.sigTerms <- sigTerms

MF.all <- GenTable(GOdata.MF, elim=MF.elim, weight01=MF.weight01, lea=MF.lea,
                   orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))

kable(
   MF.all[MF.all$Significant > MF.all$Expected,],
   caption = "Most over-represented terms of the Molecular Function ontology.")
Most over-represented terms of the Molecular Function ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
2 GO:0008417 fucosyltransferase activity 33 4 1.27 3 0.00066 0.00066 0.00066
6 GO:0016757 transferase activity, transferring glyco… 194 10 7.46 7 0.00287 0.00285 1.7e-06
7 GO:1901505 carbohydrate derivative transmembrane tr… 22 2 0.85 306 0.48252 0.00289 0.48252
10 GO:0004842 ubiquitin-protein transferase activity 76 3 2.92 152 0.18382 0.00661 0.18382
11 GO:0005085 guanyl-nucleotide exchange factor activi… 58 3 2.23 60 0.04246 0.00691 0.04246
12 GO:0008061 chitin binding 77 5 2.96 11 0.00721 0.00721 0.00721
kable(
   MF.all[MF.all$Significant < MF.all$Expected,],
   caption = "Most under-represented terms of the Molecular Function ontology.")
Most under-represented terms of the Molecular Function ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
1 GO:0005515 protein binding 2101 67 80.77 1 3.4e-08 1.3e-10 1.0e-09
3 GO:0005509 calcium ion binding 361 11 13.88 4 0.00123 0.00123 0.00123
4 GO:0036459 thiol-dependent ubiquitinyl hydrolase ac… 43 1 1.65 28 0.01699 0.00229 0.01699
5 GO:0004812 aminoacyl-tRNA ligase activity 42 0 1.61 6 0.00231 0.00231 0.00231
8 GO:0008536 Ran GTPase binding 14 0 0.54 8 0.00322 0.00322 0.00322
9 GO:0004707 MAP kinase activity 6 0 0.23 9 0.00591 0.00591 0.00591
13 GO:0005524 ATP binding 786 19 30.22 12 0.00870 0.00870 0.00870
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
                             lea=significant.lea,
                             elim=significant.elim)))

kable(data.frame(Term=Term(GOTERM[sigTerms]),
                 Definition=Definition(GOTERM[sigTerms]),
                 PValue=score(MF.weight01)[sigTerms]),
      caption = paste('Molecular function terms significantly associated with', VAR,
                      'according to all 3 algorithms', sep=' '))
Molecular function terms significantly associated with hatching according to all 3 algorithms
Term Definition PValue
GO:0005515 protein binding Interacting selectively and non-covalently with any protein or protein complex (a complex of two or more proteins that may include other nonprotein molecules). 0.0000000
GO:0008417 fucosyltransferase activity Catalysis of the transfer of a fucosyl group to an acceptor molecule, typically another carbohydrate or a lipid. 0.0006623
GO:0005509 calcium ion binding Interacting selectively and non-covalently with calcium ions (Ca2+). 0.0012279
GO:0004812 aminoacyl-tRNA ligase activity Catalysis of the formation of aminoacyl-tRNA from ATP, amino acid, and tRNA with the release of diphosphate and AMP. 0.0023073
GO:0016757 transferase activity, transferring glycosyl groups Catalysis of the transfer of a glycosyl group from one compound (donor) to another (acceptor). 0.0028549
GO:0008536 Ran GTPase binding Interacting selectively and non-covalently with Ran, a conserved Ras-like GTP-binding protein, implicated in nucleocytoplasmic transport, cell cycle progression, spindle assembly, nuclear organization and nuclear envelope (NE) assembly. 0.0032245
GO:0004707 MAP kinase activity Catalysis of the reaction: protein + ATP = protein phosphate + ADP. This reaction is the phosphorylation of proteins. Mitogen-activated protein kinase; a family of protein kinases that perform a crucial step in relaying signals from the plasma membrane to the nucleus. They are activated by a wide range of proliferation- or differentiation-inducing signals; activation is strong with agonists such as polypeptide growth factors and tumor-promoting phorbol esters, but weak (in most cell backgrounds) by stress stimuli. 0.0059098
GO:0008061 chitin binding Interacting selectively and non-covalently with chitin, a linear polysaccharide consisting of beta-(1->4)-linked N-acetyl-D-glucosamine residues. 0.0072072
GO:0005524 ATP binding Interacting selectively and non-covalently with ATP, adenosine 5’-triphosphate, a universally important coenzyme and enzyme regulator. 0.0086997
showSigOfNodes(GOdata.MF, score(MF.weight01),
               firstSigNodes = sum(significant.elim),
               wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 59 
## Number of Edges = 75 
## 
## $complete.dag
## [1] "A graph with 59 nodes."
showGroupDensity(GOdata.MF, names(score(MF.weight01))[which.min(score(MF.weight01))])

Cellular component

orderedTerms <- names(sort(score(CC.weight01)))
significant.weight01 <- score(CC.weight01)[orderedTerms] <= 0.01
significant.lea      <- score(CC.lea)[orderedTerms] <= 0.01
significant.elim     <- score(CC.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
CC.pvalue.sigTerms <- sigTerms

CC.all <- GenTable(GOdata.CC, elim=CC.elim, weight01=CC.weight01, lea=CC.lea,
                   orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))
kable(
   CC.all[CC.all$Significant > CC.all$Expected,],
   caption = "Most over-represented terms of the Cellular Component ontology.")
Most over-represented terms of the Cellular Component ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
3 GO:0005815 microtubule organizing center 37 2 1.46 9 0.01404 0.0042 0.01404
4 GO:0005885 Arp2/3 protein complex 8 2 0.32 2 0.00811 0.0081 0.00811
kable(
   CC.all[CC.all$Significant < CC.all$Expected,],
   caption = "Most under-represented terms of the Cellular Component ontology.")
Most under-represented terms of the Cellular Component ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
1 GO:0030286 dynein complex 26 0 1.02 1 0.00023 0.0021 0.00023
2 GO:0032991 protein-containing complex 886 20 34.92 29 0.03995 0.0039 0.03286
5 GO:0005680 anaphase-promoting complex 7 0 0.28 7 0.01138 0.0114 0.01138
6 GO:0044428 nuclear part 236 4 9.30 38 0.04776 0.0138 0.07751
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
                             lea=significant.lea,
                             elim=significant.elim)))

kable(data.frame(Term=Term(GOTERM[sigTerms]),
                 Definition=Definition(GOTERM[sigTerms]),
                 PValue=score(CC.weight01)[sigTerms]),
      caption = paste('Cellular component terms significantly associated with', VAR,
                      'according to all 3 algorithms', sep=' '))
Cellular component terms significantly associated with hatching according to all 3 algorithms
Term Definition PValue
GO:0030286 dynein complex Any of several large complexes that contain two or three dynein heavy chains and several light chains, and have microtubule motor activity. 0.0020997
GO:0005885 Arp2/3 protein complex A stable protein complex that contains two actin-related proteins, Arp2 and Arp3, and five novel proteins (ARPC1-5), and functions in the nucleation of branched actin filaments. 0.0081061
showSigOfNodes(GOdata.CC, score(CC.weight01),
               firstSigNodes = sum(significant.elim),
               wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 30 
## Number of Edges = 48 
## 
## $complete.dag
## [1] "A graph with 30 nodes."
showGroupDensity(GOdata.CC, names(score(CC.weight01))[which.min(score(CC.weight01))])

Using the portion of variance explained by hatching

Building the topGO object

I need to generate the topGO objects again, using the alternative gene ordering, based on the proportion of expression-level variance explained by hatching. I miss a way to inform the topGOdata object that the score in this case is reversed, with respect to p-values: the higher it is, the more differentially expressed the gene is. To make sure that GO terms are tested in the same way than when using p-values, I will just reverse the proportion of variance explained by hatching to its complement. Taking this into account, the subset of interesting genes (selection() function) must be defined as the lowest 10% scores, which are the 10% genes with largest portion of variance explained by hatching.

selection <- function(allScores) {return(allScores <= quantile(allScores, probs = 0.10))}
GOdataVar.BP <- new('topGOdata',
   description = 'Differential expression among Brachionus plicatilis populations.',
   ontology = 'BP', 
   allGenes = 1.0 - Variance,
   annot = annFUN.gene2GO,
   gene2GO = allgenes2GO,
   geneSelectionFun = selection,
   nodeSize = 5)

GOdataVar.MF <- new('topGOdata',
   description = 'Differential expression among Brachionus plicatilis populations.',
   ontology = 'MF',
   allGenes = 1.0 - Variance,
   annot = annFUN.gene2GO,
   gene2GO = allgenes2GO,
   geneSelectionFun = selection,
   nodeSize = 5)

GOdataVar.CC <- new('topGOdata',
   description = 'Differential expression among Brachionus plicatilis populations.',
   ontology = 'CC',
   allGenes = 1.0 - Variance,
   annot = annFUN.gene2GO,
   gene2GO = allgenes2GO,
   geneSelectionFun = selection,
   nodeSize = 5)

library(knitr)
DataSummary <- data.frame(ontology = c('BP', 'MF', 'CC'),
   Num_Genes = sapply(list(GOdataVar.BP, GOdataVar.MF, GOdataVar.CC), numGenes),
   Num_GO_terms = sapply(list(GOdataVar.BP, GOdataVar.MF, GOdataVar.CC), function(x) length(usedGO(x))))
kable(DataSummary, caption='Number of feasible genes or transcripts and number of GO terms used in each data set.')
rm(DataSummary)
Number of feasible genes or transcripts and number of GO terms used in each data set.
ontology Num_Genes Num_GO_terms
BP 15162 1154
MF 17429 576
CC 12940 291

Running the tests

BPvar.elim     <- runTest(GOdataVar.BP, algorithm = "elim",     statistic = "ks")
BPvar.weight01 <- runTest(GOdataVar.BP, algorithm = "weight01", statistic = "ks")
BPvar.lea      <- runTest(GOdataVar.BP, algorithm = "lea",      statistic = "ks")
MFvar.elim     <- runTest(GOdataVar.MF, algorithm = "elim",     statistic = "ks")
MFvar.weight01 <- runTest(GOdataVar.MF, algorithm = "weight01", statistic = "ks")
MFvar.lea      <- runTest(GOdataVar.MF, algorithm = "lea",      statistic = "ks")
CCvar.elim     <- runTest(GOdataVar.CC, algorithm = "elim",     statistic = "ks")
CCvar.weight01 <- runTest(GOdataVar.CC, algorithm = "weight01", statistic = "ks")
CCvar.lea      <- runTest(GOdataVar.CC, algorithm = "lea",      statistic = "ks")

ResultsSummary <- data.frame(ontology = rep(c("BP", "MF", "CC"), each = 3),
   algorithm = rep(c("elim", "weight01", "lea"), 3),
   TermsTested = sapply(list(BPvar.elim, BPvar.weight01, BPvar.lea, MFvar.elim, MFvar.weight01, MFvar.lea, CCvar.elim, CCvar.weight01, CCvar.lea), function(X) length(score(X))),
   Significant = sapply(list(BPvar.elim, BPvar.weight01, BPvar.lea, MFvar.elim, MFvar.weight01, MFvar.lea, CCvar.elim, CCvar.weight01, CCvar.lea), function(X) sum(score(X) < 0.01)))

kable(ResultsSummary, caption="Number of non-trivial terms tested and those with a score (not corrected p-value) lower than 0.01.")
Number of non-trivial terms tested and those with a score (not corrected p-value) lower than 0.01.
ontology algorithm TermsTested Significant
BP elim 1154 6
BP weight01 1154 5
BP lea 1154 9
MF elim 576 5
MF weight01 576 4
MF lea 576 7
CC elim 291 1
CC weight01 291 0
CC lea 291 1
rm(ResultsSummary)

Results

Biological process

orderedTerms <- names(sort(score(BPvar.weight01)))
significant.weight01 <- score(BPvar.weight01)[orderedTerms] <= 0.01
significant.lea      <- score(BPvar.lea)[orderedTerms] <= 0.01
significant.elim     <- score(BPvar.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
BP.variance.sigTerms <- sigTerms

BPvar.all <- GenTable(GOdataVar.BP, elim=BPvar.elim, weight01=BPvar.weight01, lea=BPvar.lea,
                      orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))

kable(
   BPvar.all[BPvar.all$Significant > BPvar.all$Expected,],
   caption = "Most over-represented terms of the Biological Process ontology.")
Most over-represented terms of the Biological Process ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
1 GO:0006887 exocytosis 26 4 2.72 26 0.0210 0.0037 0.0210
6 GO:0000902 cell morphogenesis 10 5 1.04 7 0.0101 0.0101 0.0101
kable(
   BPvar.all[BPvar.all$Significant < BPvar.all$Expected,],
   caption = "Most under-represented terms of the Biological Process ontology.")
Most under-represented terms of the Biological Process ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
2 GO:0003341 cilium movement 10 1 1.04 3 0.0068 0.0068 0.0068
3 GO:1901264 carbohydrate derivative transport 11 1 1.15 381 0.3229 0.0085 0.3229
4 GO:0015931 nucleobase-containing compound transport 29 1 3.03 643 0.6196 0.0085 0.6196
5 GO:0046907 intracellular transport 202 19 21.10 458 0.4219 0.0100 0.4356
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
                             lea=significant.lea,
                             elim=significant.elim)))

kable(data.frame(Term=Term(GOTERM[sigTerms]),
                 Definition=Definition(GOTERM[sigTerms]),
                 PValue=score(BPvar.weight01)[sigTerms]),
      caption = paste('Biological process terms significantly associated with', VAR,
                      'according to all 3 algorithms', sep=' '))
Biological process terms significantly associated with hatching according to all 3 algorithms
Term Definition PValue
GO:0003341 cilium movement The directed, self-propelled movement of a cilium. 0.0068139
showSigOfNodes(GOdataVar.BP, score(BPvar.weight01),
               firstSigNodes = sum(significant.elim),
               wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 28 
## Number of Edges = 39 
## 
## $complete.dag
## [1] "A graph with 28 nodes."

Below I plot variance portion, but for the termo found most significant when using p-values, for comparison.

showGroupDensity(GOdataVar.BP, names(score(BP.weight01))[which.min(score(BP.weight01))])

Molecular function

orderedTerms <- names(sort(score(MFvar.weight01)))
significant.weight01 <- score(MFvar.weight01)[orderedTerms] <= 0.01
significant.lea      <- score(MFvar.lea)[orderedTerms] <= 0.01
significant.elim     <- score(MFvar.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
MF.variance.sigTerms <- sigTerms

MFvar.all <- GenTable(GOdataVar.MF, elim=MFvar.elim, weight01=MFvar.weight01, lea=MFvar.lea,
                      orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))
kable(
   MFvar.all[MFvar.all$Significant > MFvar.all$Expected,],
   caption = "Most over-represented terms of the Molecular Function ontology.")
Most over-represented terms of the Molecular Function ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
2 GO:0008417 fucosyltransferase activity 33 8 3.33 3 0.00347 0.00347 0.00347
kable(
   MFvar.all[MFvar.all$Significant < MFvar.all$Expected,],
   caption = "Most under-represented terms of the Molecular Function ontology.")
Most under-represented terms of the Molecular Function ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
1 GO:0005515 protein binding 2101 185 212.04 1 0.00098 0.00072 0.00028
3 GO:0140101 catalytic activity, acting on a tRNA 69 4 6.96 425 0.85535 0.00684 0.85535
4 GO:1901505 carbohydrate derivative transmembrane tr… 22 1 2.22 239 0.50759 0.00790 0.50759
5 GO:0042578 phosphoric ester hydrolase activity 140 9 14.13 179 0.33849 0.01104 0.61057
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
                             lea=significant.lea,
                             elim=significant.elim)))

kable(data.frame(Term=Term(GOTERM[sigTerms]),
                 Definition=Definition(GOTERM[sigTerms]),
                 PValue=score(MFvar.weight01)[sigTerms]),
      caption = paste('Molecular functions terms significantly associated with', VAR,
                      'according to all 3 algorithms', sep=' '))
Molecular functions terms significantly associated with hatching according to all 3 algorithms
Term Definition PValue
GO:0005515 protein binding Interacting selectively and non-covalently with any protein or protein complex (a complex of two or more proteins that may include other nonprotein molecules). 0.0007229
GO:0008417 fucosyltransferase activity Catalysis of the transfer of a fucosyl group to an acceptor molecule, typically another carbohydrate or a lipid. 0.0034656
showSigOfNodes(GOdataVar.MF, score(MFvar.weight01),
               firstSigNodes = sum(significant.elim),
               wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 16 
## Number of Edges = 15 
## 
## $complete.dag
## [1] "A graph with 16 nodes."

I plot variance portion, but for the term found most significant when using p-values, for comparison.

showGroupDensity(GOdataVar.MF, names(score(MF.weight01))[which.min(score(MF.weight01))])

Cellular component

orderedTerms <- names(sort(score(CCvar.weight01)))
significant.weight01 <- score(CCvar.weight01)[orderedTerms] <= 0.01
significant.lea      <- score(CCvar.lea)[orderedTerms] <= 0.01
significant.elim     <- score(CCvar.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
if (length(sigTerms) == 0) {
   sigTerms <- orderedTerms[significant.lea & significant.elim]
}
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
CC.variance.sigTerms <- sigTerms

CCvar.all <- GenTable(GOdataVar.CC, elim=CCvar.elim, weight01=CCvar.weight01, lea=CCvar.lea,
                      orderBy="weight01", ranksOf="elim", topNodes=max(sum(significant.elim), 10))

kable(
   CCvar.all[CCvar.all$Significant > CCvar.all$Expected,],
   caption = "Most over-represented terms of the Cellular Component ontology.")
Most over-represented terms of the Cellular Component ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
3 GO:0044440 endosomal part 12 2 1.27 34 0.062 0.020 0.062
4 GO:0005680 anaphase-promoting complex 7 1 0.74 13 0.031 0.031 0.031
5 GO:0005890 sodium:potassium-exchanging ATPase compl… 5 1 0.53 21 0.044 0.044 0.044
kable(
   CCvar.all[CCvar.all$Significant < CCvar.all$Expected,],
   caption = "Most under-represented terms of the Cellular Component ontology.")
Most under-represented terms of the Cellular Component ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
1 GO:0000123 histone acetyltransferase complex 22 2 2.33 30 0.053 0.011 0.053
2 GO:0032991 protein-containing complex 886 65 93.74 201 0.786 0.017 0.966
6 GO:0032580 Golgi cisterna membrane 8 0 0.85 24 0.046 0.046 0.046
7 GO:0044428 nuclear part 236 20 24.97 66 0.184 0.064 0.957
8 GO:0045211 postsynaptic membrane 16 0 1.69 39 0.067 0.067 0.067
9 GO:0044798 nuclear transcription factor complex 25 0 2.64 52 0.108 0.072 0.108
10 GO:1990234 transferase complex 111 6 11.74 65 0.182 0.074 0.880
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
                             lea=significant.lea,
                             elim=significant.elim)))

kable(data.frame(Term=Term(GOTERM[sigTerms]),
                 Definition=Definition(GOTERM[sigTerms]),
                 PValue=score(CCvar.weight01)[sigTerms]),
      caption = paste('Cellular component terms significantly associated with', VAR,
                      'according to all 3 algorithms', sep=' '))
Cellular component terms significantly associated with hatching according to all 3 algorithms
Term Definition PValue
GO:0005929 cilium A specialized eukaryotic organelle that consists of a filiform extrusion of the cell surface and of some cytoplasmic parts. Each cilium is largely bounded by an extrusion of the cytoplasmic (plasma) membrane, and contains a regular longitudinal array of microtubules, anchored to a basal body. 0.2040169
showSigOfNodes(GOdataVar.CC, score(CCvar.weight01),
               firstSigNodes = sum(significant.weight01),
               wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 28 
## Number of Edges = 44 
## 
## $complete.dag
## [1] "A graph with 28 nodes."

For comparison, I plot the distribution of variance portion for the CC term found most significant when using p-values.

showGroupDensity(GOdataVar.CC, names(score(CC.weight01))[which.min(score(CC.weight01))])

Comparison between the two ordering of genes.

Biological process

allTerms <- usedGO(GOdata.BP)
vennDiagram(vennCounts(cbind(PValue   = allTerms %in% BP.pvalue.sigTerms,
                             Variance = allTerms %in% BP.variance.sigTerms)))

ggplot(data = data.frame(PValue   = rank(score(BP.weight01))[allTerms],
                         Variance = rank(score(BPvar.weight01))[allTerms]),
       mapping = aes(x = PValue, y = Variance)) +
   geom_point() + geom_smooth() + xlab('Using gene p-values') +
   ylab('Using portion of explained variance') +
   ggtitle('Ordering of BP terms by significance')

Molecular function

allTerms <- usedGO(GOdata.MF)
vennDiagram(vennCounts(cbind(PValue   = allTerms %in% MF.pvalue.sigTerms,
                             Variance = allTerms %in% MF.variance.sigTerms)))

ggplot(data = data.frame(PValue   = rank(score(MF.weight01))[allTerms],
                         Variance = rank(score(MFvar.weight01))[allTerms]),
       mapping = aes(x = PValue, y = Variance)) +
   geom_point() + geom_smooth() + xlab('Using gene p-values') +
   ylab('Using portion of explained variance') +
   ggtitle('Ordering of MF terms by significance')

Cellular component

allTerms <- usedGO(GOdata.CC)
vennDiagram(vennCounts(cbind(PValue   = allTerms %in% CC.pvalue.sigTerms,
                             Variance = allTerms %in% CC.variance.sigTerms)))

ggplot(data = data.frame(PValue   = rank(score(CC.weight01))[allTerms],
                         Variance = rank(score(CCvar.weight01))[allTerms]),
       mapping = aes(x = PValue, y = Variance)) +
   geom_point() + geom_smooth() + xlab('Using gene p-values') +
   ylab('Using portion of explained variance') +
   ggtitle('Ordering of CC terms by significance')

Session info

I save the main variables to be able to load them in a new R session later.

save(allgenes2GO,
     GOdata.BP, BP.elim, BP.weight01, BP.lea, BP.pvalue.sigTerms,
     GOdata.MF, MF.elim, MF.weight01, MF.lea, MF.pvalue.sigTerms,
     GOdata.CC, CC.elim, CC.weight01, CC.lea, CC.pvalue.sigTerms,
     GOdataVar.BP, BPvar.elim, BPvar.weight01, BPvar.lea, BP.variance.sigTerms,
     GOdataVar.MF, MFvar.elim, MFvar.weight01, MFvar.lea, MF.variance.sigTerms,
     GOdataVar.CC, CCvar.elim, CCvar.weight01, CCvar.lea, CC.variance.sigTerms,
     file = paste('Enrichment', TAG, VAR, 'RData', sep='.'))
sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=ca_ES.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=ca_ES.UTF-8    
##  [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=ca_ES.UTF-8   
##  [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] grid      stats4    parallel  stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] Rgraphviz_2.30.0     ggplot2_3.2.1        limma_3.42.0        
##  [4] knitr_1.27           topGO_2.38.1         SparseM_1.78        
##  [7] GO.db_3.10.0         AnnotationDbi_1.48.0 IRanges_2.20.2      
## [10] S4Vectors_0.24.3     Biobase_2.46.0       graph_1.64.0        
## [13] BiocGenerics_0.32.0 
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.5   xfun_0.12          purrr_0.3.3        splines_3.6.2     
##  [5] lattice_0.20-38    colorspace_1.4-1   vctrs_0.2.1        htmltools_0.4.0   
##  [9] yaml_2.2.0         mgcv_1.8-31        blob_1.2.1         rlang_0.4.2       
## [13] pillar_1.4.3       glue_1.3.1         withr_2.1.2        DBI_1.1.0         
## [17] bit64_0.9-7        matrixStats_0.55.0 lifecycle_0.1.0    stringr_1.4.0     
## [21] munsell_0.5.0      gtable_0.3.0       memoise_1.1.0      evaluate_0.14     
## [25] labeling_0.3       highr_0.8          Rcpp_1.0.3         backports_1.1.5   
## [29] scales_1.1.0       farver_2.0.3       bit_1.1-15.1       digest_0.6.23     
## [33] stringi_1.4.5      dplyr_0.8.3        tools_3.6.2        magrittr_1.5      
## [37] lazyeval_0.2.2     tibble_2.1.3       RSQLite_2.2.0      crayon_1.3.4      
## [41] pkgconfig_2.0.3    zeallot_0.1.0      Matrix_1.2-18      assertthat_0.2.1  
## [45] rmarkdown_2.1      R6_2.4.1           nlme_3.1-143       compiler_3.6.2